proper noun
NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection
Ansari, Amirhossein, Wang, Ke, Xiong, Pulei
Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. W e propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. W e evaluate NegRefine on large-scale benchmarks, including ImageNet-1K.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Costa Rica (0.04)
- North America > Canada (0.04)
- (2 more...)
Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
Bondok, Rawan, Nassar, Mayar, Khalifa, Salam, Micallef, Kurt, Habash, Nizar
Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- (17 more...)
Extracting domain-specific terms using contextual word embeddings
Repar, Andraž, Lavrač, Nada, Pollak, Senja
Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.
- Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
- Europe > Slovenia > Gorizia > Municipality of Nova Gorica > Nova Gorica (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Asia > Malaysia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth
Li, Jiawei, Xu, Xiaoang, Gao, Yang
Model evolution enables learning from feedback to refine experiences and update skills, transforming models from having no domain knowledge to becoming domain experts. However, there is currently no unified and effective method for guiding this evolutionary process. To address this gap, we propose the Meteor method, which includes three training phases: weak-to-strong data distillation, iterative training, and self-evolution strategies. Each phase maximizes the model's inherent domain capabilities, allowing it to autonomously refine its domain knowledge and enhance performance. Experiments demonstrate that our approach significantly improves accuracy, completeness, relevance, coherence, and reliability across domain-specific tasks.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- Asia > Singapore (0.04)
- (8 more...)
- Overview (0.93)
- Research Report > New Finding (0.46)
DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
Wang, Yutong, Zeng, Jiali, Liu, Xuebo, Wong, Derek F., Meng, Fandong, Zhou, Jie, Zhang, Min
Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT). However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents. In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations. DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components. Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average. DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method. Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks. We release our code and data at https://github.com/YutongWang1216/DocMTAgent.
- Asia > Singapore (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (16 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Annotation Guidelines for Corpus Novelties: Part 1 -- Named Entity Recognition
Amalvy, Arthur, Labatut, Vincent
It was constituted mainly to fulfill two goals: in the short term, train and test NER methods able to handle long texts, and in the longer term, be used to develop Renard [3], a pipeline aiming at extracting character networks from literary fiction. This pipeline includes several processing steps after the NER, including coreference resolution and character unification. Character networks can be used to tackle a number of tasks, including the assessment of literary theories, the level of historicity of a narrative, detecting roles in stories, classifying novels, identify subplots, segment a storyline, summarize a story, design recommendation systems, align narratives, etc. See the detailed survey of Labatut and Bost [11] for more information regarding character networks. This context drives the elaboration of the corpus, which explains why it exhibits certain differences with many similar NER corpora, such as CoNLL-2003 [17] or OntoNotes v5 [20]. We originally based Novelties on the literary corpus from Dekker et al. [6] as we describe in Section A of the appendix. Note that there are other literary NER corpora (cf.
- Europe > United Kingdom > England (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- (6 more...)
The Self-Contained Negation Test Set
Kletz, David, Amsili, Pascal, Candito, Marie
Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs' predictions as a function of the polarity of inputs, in English. Crucially, this test uses ``self-contained'' inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Pennsylvania (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Development of an AI Anti-Bullying System Using Large Language Model Key Topic Detection
Tassava, Matthew, Kolodjski, Cameron, Milbrath, Jordan, Bishop, Adorah, Flanders, Nathan, Fetsch, Robbie, Hanson, Danielle, Straub, Jeremy
It has become a pronounced problem due to the increasing ubiquity of online platforms that provide a means to conduct it. A significant amount of this cyberbullying is conducted by and targets teenagers. It is difficult for teenage students to shut themselves off from the digital world in which the cyberbullying is taking place. Given how entrenched the use of digital apps is by today's youth, and the pronounced consequences of it - including victim self-harm, in some cases - cyberbullying is at least as much of a threat as physical bullying. Additionally, because of the obfuscation caused by the online environment, authorities (such as parents, teachers and law enforcement) may have difficulty determining what has occurred and who the actors participating are.
- Africa (0.04)
- Oceania > New Zealand (0.04)
- Europe > Italy > Tuscany (0.04)
- (9 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Education > Educational Setting > K-12 Education (0.45)
XSTEM: An exemplar-based stemming algorithm
Stemming is the process of reducing related words to a standard form by removing affixes from them. For example, eating, eats, and eaten can be reduced to the standard form eat by removing the -ing, -s, and -en from each. Stemming is a foundational step in many text processing pipelines, including information retrieval and language modeling [2], and as a feature reduction step for classification or lexical transfer learning tasks [1]. Most stemming algorithms are either rule-based or corpus-based. Rule-based stemmers use a set of manually created rules to transform words to their base form, typically in conjunction with reference to a dictionary for handling exceptions or modulating their output in some fashion. Corpus-based stemmers typically employ statistical machine learning methods to equate word forms based on distributional regularities derived from large amounts of text. Although researchers have demonstrated impressive results with corpus-based approaches (e.g., [2, and references therein]), rule-based stemming implementa-I was introduced to exemplar theory by Keith Johnson in a phonetics seminar at Ohio State. His model is called XMOD, which, as he notes, is the best name [7]. The name XSTEM comes from that.
- North America > United States > Ohio (0.24)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Maldives (0.04)
Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-Identification
Staufer, Dimitri, Pallas, Frank, Berendt, Bettina
Whistleblowing is essential for ensuring transparency and accountability in both public and private sectors. However, (potential) whistleblowers often fear or face retaliation, even when reporting anonymously. The specific content of their disclosures and their distinct writing style may re-identify them as the source. Legal measures, such as the EU WBD, are limited in their scope and effectiveness. Therefore, computational methods to prevent re-identification are important complementary tools for encouraging whistleblowers to come forward. However, current text sanitization tools follow a one-size-fits-all approach and take an overly limited view of anonymity. They aim to mitigate identification risk by replacing typical high-risk words (such as person names and other NE labels) and combinations thereof with placeholders. Such an approach, however, is inadequate for the whistleblowing scenario since it neglects further re-identification potential in textual features, including writing style. Therefore, we propose, implement, and evaluate a novel classification and mitigation strategy for rewriting texts that involves the whistleblower in the assessment of the risk and utility. Our prototypical tool semi-automatically evaluates risk at the word/term level and applies risk-adapted anonymization techniques to produce a grammatically disjointed yet appropriately sanitized text. We then use a LLM that we fine-tuned for paraphrasing to render this text coherent and style-neutral. We evaluate our tool's effectiveness using court cases from the ECHR and excerpts from a real-world whistleblower testimony and measure the protection against authorship attribution (AA) attacks and utility loss statistically using the popular IMDb62 movie reviews dataset. Our method can significantly reduce AA accuracy from 98.81% to 31.22%, while preserving up to 73.1% of the original content's semantics.
- North America > United States (1.00)
- Europe > Austria > Vienna (0.14)
- North America > Montserrat (0.04)
- (12 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)